{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tic Tac Toe\n",
    "\n",
    "You can also open this on [Google Colab](https://colab.research.google.com/github/camel-ai/camel/blob/master/docs/cookbooks/loong/multi_step_rl.ipynb)\n",
    "\n",
    "In this cookbook, I want to show how Multi-Step environments work in CAMEL. Our RL modules were built to mimic OpenAI Gym, so if you're familiar with Gym's interface, you'll feel right at home.\n",
    "\n",
    "We will use the Tic-Tac-Toe environment as an example to show the lifecycle of an environment.\n",
    "\n",
    "The Tic-Tac-Toe environment can be used to evaluate agents, generate synthetic data for distillation, or train an agent to play the game.\n",
    "\n",
    "First, we need to initialize our environment and set it up. Then we can call `reset` to get our initial observation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's install the CAMEL package with all its dependencies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install camel-ai[all]==0.2.46"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Initial Observation:\n",
      "\n",
      "You are playing Tic Tac Toe with standard rules.\n",
      "You are the player with X.\n",
      "Choose a number between 1 and 9 to place an X.\n",
      "This is the current state of the board:\n",
      "1 | 2 | 3\n",
      "---------\n",
      "4 | 5 | 6\n",
      "---------\n",
      "7 | 8 | 9\n",
      "Each number that you can see is still an empty field that you can place your 'X' in. Please end your response with <Action> [a number from 1 to 9]\n"
     ]
    }
   ],
   "source": [
    "import asyncio\n",
    "from camel.environments.models import Action\n",
    "from camel.environments.tic_tac_toe import TicTacToeEnv, Opponent\n",
    "\n",
    "# we can choose the playstyle of our opponent to be either 'random' or 'optimal' (computed using minimax)\n",
    "opp = Opponent(play_style=\"random\")\n",
    "env = TicTacToeEnv(opponent=opp)\n",
    "await env.setup()\n",
    "\n",
    "obs = await env.reset()\n",
    "print(\"Initial Observation:\\n\")\n",
    "print(obs.question)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use GPT-4o-mini, so let's enter our API key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "openai_api_key = getpass('Enter your API key: ')\n",
    "os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, if running on Colab, you could save your API keys and tokens as **Colab Secrets**, and use them across notebooks.\n",
    "\n",
    "To do so, **comment out** the above **manual** API key prompt code block(s), and **uncomment** the following codeblock.\n",
    "\n",
    "⚠️ Don't forget granting access to the API key you would be using to the current notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import os\n",
    "# from google.colab import userdata\n",
    "\n",
    "# os.environ[\"OPENAI_API_KEY\"] = userdata.get(\"OPENAI_API_KEY\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's next define the model-backend and the agent.\n",
    "\n",
    "You can also add a system prompt or equip your agent with tools, but for the sake of simplicity we just create a bare agent with GPT-4o-mini."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from camel.models import ModelFactory\n",
    "from camel.types import ModelPlatformType, ModelType\n",
    "from camel.configs import ChatGPTConfig\n",
    "from camel.agents import ChatAgent\n",
    "\n",
    "model = ModelFactory.create(\n",
    "    model_platform=ModelPlatformType.OPENAI,\n",
    "    model_type=ModelType.GPT_4O_MINI,\n",
    "    model_config_dict=ChatGPTConfig().as_dict(),\n",
    ")\n",
    "\n",
    "agent = ChatAgent(model=model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will simulate one episode."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Agent Move: I'll place my X in position 5. \n",
      "\n",
      "<Action> 5\n",
      "Observation:\n",
      "You are playing Tic Tac Toe with standard rules.\n",
      "You are the player with X.\n",
      "Choose a number between 1 and 9 to place an X.\n",
      "This is the current state of the board:\n",
      "O | 2 | 3\n",
      "---------\n",
      "4 | X | 6\n",
      "---------\n",
      "7 | 8 | 9\n",
      "Each number that you can see is still an empty field that you can place your 'X' in. Please end your response with <Action> [a number from 1 to 9]\n",
      "Reward: 0.5\n",
      "Done: False\n",
      "Info: {'extraction_result': '5', 'step': 1, 'state': {'board': ['O', ' ', ' ', ' ', 'X', ' ', ' ', ' ', ' '], 'game_over': False, 'winner': None, 'last_move_illegal': False, 'last_move': 5, 'extraction_error': None}, 'rewards_dict': {'x_non_loss_value': 0.5}}\n",
      "\n",
      "Agent Move: I'll place an X in position 2. \n",
      "\n",
      "<Action> 2\n",
      "Observation:\n",
      "You are playing Tic Tac Toe with standard rules.\n",
      "You are the player with X.\n",
      "Choose a number between 1 and 9 to place an X.\n",
      "This is the current state of the board:\n",
      "O | X | 3\n",
      "---------\n",
      "4 | X | 6\n",
      "---------\n",
      "7 | O | 9\n",
      "Each number that you can see is still an empty field that you can place your 'X' in. Please end your response with <Action> [a number from 1 to 9]\n",
      "Reward: 0.5\n",
      "Done: False\n",
      "Info: {'extraction_result': '2', 'step': 2, 'state': {'board': ['O', 'X', ' ', ' ', 'X', ' ', ' ', 'O', ' '], 'game_over': False, 'winner': None, 'last_move_illegal': False, 'last_move': 2, 'extraction_error': None}, 'rewards_dict': {'x_non_loss_value': 0.5}}\n",
      "\n",
      "Agent Move: In this situation, I will place my 'X' in position 9 to block the opponent's potential win on their next turn. \n",
      "\n",
      "<Action> 9\n",
      "Observation:\n",
      "You are playing Tic Tac Toe with standard rules.\n",
      "You are the player with X.\n",
      "Choose a number between 1 and 9 to place an X.\n",
      "This is the current state of the board:\n",
      "O | X | O\n",
      "---------\n",
      "4 | X | 6\n",
      "---------\n",
      "7 | O | X\n",
      "Each number that you can see is still an empty field that you can place your 'X' in. Please end your response with <Action> [a number from 1 to 9]\n",
      "Reward: 0.5\n",
      "Done: False\n",
      "Info: {'extraction_result': '9', 'step': 3, 'state': {'board': ['O', 'X', 'O', ' ', 'X', ' ', ' ', 'O', 'X'], 'game_over': False, 'winner': None, 'last_move_illegal': False, 'last_move': 9, 'extraction_error': None}, 'rewards_dict': {'x_non_loss_value': 0.5}}\n",
      "\n",
      "Agent Move: To win the game, I can place my 'X' in position 4, completing a vertical column. \n",
      "\n",
      "<Action> 4\n",
      "Observation:\n",
      "You are playing Tic Tac Toe with standard rules.\n",
      "You are the player with X.\n",
      "Choose a number between 1 and 9 to place an X.\n",
      "This is the current state of the board:\n",
      "O | X | O\n",
      "---------\n",
      "X | X | O\n",
      "---------\n",
      "7 | O | X\n",
      "Each number that you can see is still an empty field that you can place your 'X' in. Please end your response with <Action> [a number from 1 to 9]\n",
      "Reward: 0.5\n",
      "Done: False\n",
      "Info: {'extraction_result': '4', 'step': 4, 'state': {'board': ['O', 'X', 'O', 'X', 'X', 'O', ' ', 'O', 'X'], 'game_over': False, 'winner': None, 'last_move_illegal': False, 'last_move': 4, 'extraction_error': None}, 'rewards_dict': {'x_non_loss_value': 0.5}}\n",
      "\n",
      "Agent Move: The current board looks like this:\n",
      "\n",
      "```\n",
      " O | X | O\n",
      "---------\n",
      " X | X | O\n",
      "---------\n",
      " 7 | O | X\n",
      "```\n",
      "\n",
      "You can place your 'X' in position 7 to win the game. \n",
      "\n",
      "I'll choose 7. \n",
      "\n",
      "<Action> [7]\n",
      "Observation:\n",
      "You are playing Tic Tac Toe with standard rules.\n",
      "You are the player with X.\n",
      "Your last move was illegal.\n",
      "You chose the move None.Choose another number between 1 and 9 to place an X.\n",
      "The field must still be available.\n",
      "This is the current state of the board:\n",
      "O | X | O\n",
      "---------\n",
      "X | X | O\n",
      "---------\n",
      "7 | O | X\n",
      "Each number that you can see is still an empty field that you can place your 'X' in. Please end your response with <Action> [a number from 1 to 9]\n",
      "Reward: 0.0\n",
      "Done: False\n",
      "Info: {'extraction_result': None, 'step': 5, 'state': {'board': ['O', 'X', 'O', 'X', 'X', 'O', ' ', 'O', 'X'], 'game_over': False, 'winner': None, 'last_move_illegal': True, 'last_move': None, 'extraction_error': 'Could not extract a valid move'}, 'rewards_dict': {'illegal_move': 0.0}}\n",
      "\n",
      "Agent Move: The current state of the board is as follows:\n",
      "\n",
      "```\n",
      "O | X | O\n",
      "---------\n",
      "X | X | O\n",
      "---------\n",
      "7 | O | X\n",
      "```\n",
      "\n",
      "The only available field is 7. I will place my X there.\n",
      "\n",
      "<Action> 7\n",
      "Observation:\n",
      "O | X | O\n",
      "---------\n",
      "X | X | O\n",
      "---------\n",
      "X | O | X\n",
      "Game Over. It's a draw!\n",
      "Reward: 0.5\n",
      "Done: True\n",
      "Info: {'extraction_result': '7', 'step': 6, 'state': {'board': ['O', 'X', 'O', 'X', 'X', 'O', 'X', 'O', 'X'], 'game_over': True, 'winner': 'draw', 'last_move_illegal': False, 'last_move': 7, 'extraction_error': None}, 'rewards_dict': {'draw': 0.5}}\n"
     ]
    }
   ],
   "source": [
    "while not env.is_done():\n",
    "\n",
    "    llm_response = agent.step(obs.question).msgs[0].content\n",
    "    agent.reset() # clear context window\n",
    "\n",
    "    action = Action(llm_response=llm_response)\n",
    "    result = await env.step(action)\n",
    "\n",
    "    next_obs, reward, done, info = result\n",
    "\n",
    "    obs = next_obs\n",
    "\n",
    "    print(\"\\nAgent Move:\", action.llm_response)\n",
    "    print(\"Observation:\")\n",
    "    print(next_obs.question)\n",
    "    print(\"Reward:\", reward)\n",
    "    print(\"Done:\", done)\n",
    "    print(\"Info:\", info)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Eval\n",
    "\n",
    "We can also use this to eval a model on tic tac toe.\n",
    "\n",
    "Let's run it for 10 episodes and see how often we win, draw and lose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "=== Summary after 10 Episodes ===\n",
      "Wins:   0\n",
      "Draws:  2\n",
      "Losses: 8\n"
     ]
    }
   ],
   "source": [
    "wins = 0\n",
    "losses = 0\n",
    "draws = 0\n",
    "\n",
    "for episode in range(10):\n",
    "    obs = await env.reset()  # Start fresh\n",
    "    done = False\n",
    "\n",
    "    while not done:\n",
    "        llm_response = agent.step(obs.question).msgs[0].content\n",
    "        agent.reset()  # Nuke the context\n",
    "\n",
    "        action = Action(llm_response=llm_response)\n",
    "        next_obs, reward, done, info = await env.step(action)\n",
    "\n",
    "        obs = next_obs\n",
    "\n",
    "    # Tally result based on final reward\n",
    "    if reward == 1:\n",
    "        wins += 1\n",
    "    elif reward == 0.5:\n",
    "        draws += 1\n",
    "    else:\n",
    "        losses += 1\n",
    "\n",
    "# Final report\n",
    "print(\"\\n=== Summary after 10 Episodes ===\")\n",
    "print(f\"Wins:   {wins}\")\n",
    "print(f\"Draws:  {draws}\")\n",
    "print(f\"Losses: {losses}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, GPT-4o-mini is quite bad!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we close the environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "await env.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
